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Abstract 

The diversity of populations in domestic species offers great opportunities to study genome response to selection. The 
recently published Sheep HapMap dataset is a great example of characterization of the world wide genetic diversity in 
sheep. In this study, we re-analyzed the Sheep HapMap dataset to identify selection signatures in worldwide sheep 
populations. Compared to previous analyses, we made use of statistical methods that (i) take account of the hierarchical 
structure of sheep populations, (ii) make use of linkage disequilibrium information and (iii) focus specifically on either recent 
or older selection signatures. We show that this allows pinpointing several new selection signatures in the sheep genome 
and distinguishing those related to modern breeding objectives and to earlier post-domestication constraints. The newly 
identified regions, together with the ones previously identified, reveal the extensive genome response to selection on 
morphology, color and adaptation to new environments. 
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Introduction 

Domestication of animals and plants has played a major role in 
human history. With the advance of high-throughput genotyping 
and sequencing technologies, the analysis of large datasets in 
domesticated species offers great opportunities to study genome 
evolution in response to phenotypic selection [1]. The sheep was 
one of the first grazing animals to be domesticated [2] in part due 
to its manageable size and an ability to adapt to different climates 
and diets with poor nutrition. A large variety of breeds with 
distinct morphology, coat color or specialized production (meat, 
milk or wool) were subsequently shaped by artificial selection. 
Since the release of the 50K SNP array [3], it is now possible to 
scan genetic diversity in sheep in order to detect loci that have 
been involved in these various adaptive selection events. The 
Sheep HapMap dataset, which includes 50K genotypes for 3000 
animals from 74 breeds with diverse world-wide origins, provides a 
considerable resource for deciphering the genetic bases of 
phenotype diversification in sheep. In the first analysis of this 
dataset [4], the authors looked for selection by computing a global 
Fst among the 74 breeds at all SNP in the genome. They 
identified 3 1 genome regions with extreme differentiation between 
breeds, which included candidate genes related to coat pigmen- 
tation, skeletal morphology, body size, growth, and reproduction. 
Further studies took advantage of the Sheep HapMap resource to 
detect genetic variants associated with pigmentation [5], fat 
deposition [6], or microphtalmia disease [7]. An other study [8] 



performed a genome scan for selection focused on American 
synthetic breeds, using an Fst approach similar to that in [4]. 

The 74 breeds of the Sheep HapMap dataset have a strong 
hierarchical structure, with at least 3 distinct differentiation levels: 
an inter-continental level (e.g. European breeds vs Asian breeds), 
an intra-continental level (e.g. Texel vs Suffolk European breeds), 
and an intra-breed level (e.g. German Texel vs Scottish Texel 
flocks). Recent studies [9-12] showed that, when applied to 
hierarchically structured data sets, Fst based genome scans for 
selection may lead to a large proportion of false positives (neutral 
loci wrongly detected as under selection) and false negatives 
(undetected loci under selection). Besides, the heterogeneity of 
effective population size among breeds implies that some breeds 
are more prone to contribute large locus-specific Fst values than 
others [10]. Apart from these statistical considerations, merging 
populations with various degrees of shared ancestry can limit our 
understanding of the selective process at detected loci. Indeed, the 
regions pointed out in [4] can be related to either ancient selection, 
as the poll locus which has likely been under selection for 
thousands of years, or fairly recent selection, as the myostatin locus 
which has been specifically selected in the Texel breed. But in 
most situations the time scale of adaptation cannot be easily 
determined. 

Another limit of genome scans for selection based on single SNP 
Fst computations is that they do not sufficiently account for the 
very rich linkage disequilibrium information, even when the single 
SNP statistics are combined into windowed statistics. Recently, we 
proposed a new strategy to evaluate the haplotype differentiation 
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between populations [13]. We showed that using this approach 
greatly increases the detection power of selective sweeps from SNP 
chip data, and also enables to detect soft or incomplete sweeps. 
These latter selection scenarios are particularly relevant in 
breeding populations, where selection objectives have likely varied 
along time and where the traits under selection are often 
polygenic. 

In this study we provide a new genome scan for selection based 
on the Sheep HapMap dataset, where we distinguish selective 
sweeps within and between 7 broad geographical groups. The 
within group analysis aims at detecting recent selection events 
related to the diversification of modern breeds. It is based on the 
single marker FLK test [10] and on its haplotypic extension 
hapFLK [13]. The FLK test is an extension of the Lewontin and 
Krakauer (LK) test [14] that accounts for population size 
heterogeneity and for the hierarchical structure between popula- 
tions. As the LK test, the FLK test computes a global Fst for each 
SNP, but allele frequencies are first rescaled using a population 
kinship matrix F. This matrix, which is estimated from the 
observed genome wide data, measures the amount of genetic drift 
that can be expected, under neutral evolution, along all branches 
of the population tree. With this rescaling, allele frequency 
differences are typically down-weighted if they are obtained with 
small populations, or populations that diverged a long time ago. 
The between group analysis focuses on older selection events and 
is only based on FLK. Overall, we confirmed 1 9 of the 3 1 sweeps 
discovered in [4], while providing more details about the past 
selection process at these loci. We also identified 7 1 new selection 
signatures, with candidate genes related to coloration, morphology 
or production traits. 

Results and Discussion 

We detected selection signatures using methods that aim at 
identifying regions of outstanding genetic differentiation between 
populations, based either on single SNP, FLK [10], or haplotype, 
hap FLK [13], information. These methods have optimal power 
when working on closely related populations so we separately 
analyzed seven groups of breeds, previously identified as sharing 
recent common ancestry [4] and corresponding to geographical 
origins of breeds. Before performing genome scans for selection 
signatures, we studied the population structure of each group to 
identify oudier animals as well as admixed and strongly 
bottlenecked populations, using both PCA and model-based 
approaches [15,16]. hapFLK was found to be robust to 
bottlenecks or moderate levels of admixture, but these phenomena 
may affect the detection power so we preferred to minimize their 
influence by removing suspect animals or populations. Details of 
these corrections are provided in the methods section. The final 
composition of population groups are given in Table 1. 

Overview of selected regions 

An overview of selection signatures on the genome across the 
different groups is plotted in Figure 1 and a detailed description is 
provided in Table 2. Detected regions were typically a few 
megabases long and included from 1 to 1 96 genes, with a median 
of 15 genes. However, in many regions strong functional candidate 
genes were found very close to the position with lowest p-value, 
typically among the two closest genes from this position. These 
genes are reported in Table 2, as well as a few other functional 
candidates with less statistical evidence but strong prior knowledge 
from the literature. We found 41 selection signatures with hapFLK 
and 26 with FLK, although we allowed a slighdy higher false 
discovery rate for FLK than hapFLK (10% vs 5%). This result was 



consistent with a higher power for hapFLK than FLK, as already 
shown in [13]. 

Four regions were found with both the single SNP and the 
haplotype test and harbor strong candidate genes: NPR2, KIT, 
RXFP2 and EDN3 (Table 2). The overlap was thus small, 
illustrating that the two tests tend to capture different signals. In 
particular, hapFLK will fail to detect ancient selective sweeps, for 
which the mutation-carrying haplotype is small and not associated 
with many SNP on the chip. On the contrary, single SNP tests will 
fail to capture selective sweeps when a single SNP is not in high 
LD with the causal mutation. They will also fail if the selected 
mutation is only at intermediate frequency but is associated to a 
long haplotype, in contrast with hapFLK. 

Six regions were detected in more than one group of breeds. 
They all contained strong candidate genes (Table 2). Three of 
these genes are related to coat color (KIT, KITLG and MC1R), 
and could correspond to independent selection events (see 
discussion below). One region harbors a gene (RXFP2) for which 
polymorphisms have been shown to affect horn size and polledness 
in the Soay [17] and Australian Merino [18]. We detected this 
region in 4 different groups and in all of them the highest FLK 
value was found to be very close to RXFP2 (Figure S8 in File SI). 
This provides clear indication that selection in this region is related 
to RXFP2, consistent with previous selection signatures detected 
by comparing specifically horned and polled breeds (Figure 6 in 
[4]). However, we note that the signatures of selection in this 
region exhibit different patterns among groups. The signal is very 
narrow in the SWE and SWA groups, and is in fact not detected 
by the hapFLK test, whereas it affects a large genome region in the 
CEU group where it is detected by hapFLK. In the ITA group, 
the FLK statistics do not reach significance, and the hapFLK 
signal is not high (minimum q-value of 0.04). Overall, the selection 
signatures suggest that selection on RXFP2, most likely due to 
selection on horn phenotypes, was carried out worldwide at 
different times and intensities. Another region harbors the 
HMGA2 gene, involved in selection for stature in dogs [19] and 
associated to body size in horses [20] and height in humans [21]. 
The last region includes two interesting candidate genes: ABCG2, 
which has been associated to a strong QTL for milk production in 
cattle [22], and NCAPG, which has been associated to fetal 
growth [23] and calving ease [24] in cattle and which is located in 
several selection signatures in this species [25-28] . In our analysis, 
populations with a selection signature in this region belong to three 
European groups (SWE, ITA and CEU) and our results suggest 
that selection in these different groups might imply distinct genes 
(Table 2). 

In the paper presenting the Sheep HapMap dataset [4], 31 
selection signatures were found, corresponding to the 0. 1 % highest 
single SNP Fst- Using FLK and hapFLK, we confirmed 
signatures of selection for 10 of these regions. Considering the 
two analyses were performed on the same dataset, this overlap can 
be considered as rather small. Two reasons can explain this. 

First, the previous analysis was based on the Fst statistic. 
Although this statistic is commonly used for selection scans, it is 
prone to produce false positives when the population tree harbors 
unequal branch lengths (i.e. unequal effective population sizes) 
[10]. In particular, strongly bottlenecked breeds will contribute 
high Fst values preferentially even under neutral evolution, 
because their smaller effective population size implies a larger 
variance of allele frequencies. With FLK and hapFLK, Fst values 
between populations are rescaled using branch lengths, so 
populations with long branch lengths will not contribute more 
than others [13]. In fact they will tend to contribute less, as the 
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Table 1. Population groups from the Sheep HapMap dataset used for the detection of selection signatures. 



Group 



Abbreviation 



Size 



Populations (Abbreviations) 



Africa 



Asia 



AFR 



ASI 



Central Europe 



Italy 



Northern Europe 



South West Asia 



South West Europe 



CEU 



ITA 



NEU 



SWA 



SWE 



Red Maasai (RMA) 
Ethiopian Menz (EMZ) 
Bangladeshi BGE (BGE) 
Bangladeshi Garole (BGA) 
Changthangi (CHA) 
Deccani (IDC) 
Garut (GUR) 
Indian Garole (GAR) 
Sumatra (SUM) 
Tibetan (TIB) 

Bundner Oberlander (BOS) 
Engadine Red (ERS) 
Valais Blacknose (VBS) 
Valais Red (VRS) 
Altamurana (ALT) 
Comisana (COM) 
Leccese (LEC) 

Sardinian Ancestral Black (SAB) 
Galway (GAL) 

German (GTX), New Zealand (NTX) and Scottish (STX) Texel 

Irish Suffolk (ISF) 

New Zealand Romney (NZR) 

Afshari (AFS) 

Moghani (MOG) 

Norduz (NDZ) 

Qezel (QEZ) 

Autralian Merino (MER) 

Churra (CHU) 

Meat (LAM) and Milk (LAC) Lacaune 



doi:1 0.1 371 /journal.pone.01 0381 3.t001 



statistical power to distinguish selective effects from drift effects is 
naturally lower in populations where drift is larger. 

Second, the previous analysis was performed using all breeds at 
the same time. It is therefore possible that some of these regions 
correspond to differentiation between groups of breeds rather than 
within groups. To investigate this question, we performed a 
genome scan for selection between seven virtual populations 
corresponding to the ancestors of the seven population groups. 
Allele frequencies in each of these ancestral populations were 
estimated from those observed in modern breeds and regions with 
outlying genetic differentiation between ancestral populations were 
detected using the FLK statistic [10]. For this analysis, we did not 
include SNP lying in regions detected within groups since selection 
biases their estimated ancestral allele frequencies. The ancestral 
population tree was reconstructed using SNP for which we have 
unambiguous ancestral allele information (Figure S9 in File SI). 
This tree is decomposed into two main lineages, one for European 
breeds and one for Asian and African breeds. The African group 
exhibits a slightly higher branch length. We note, however, that 
this could be due to ascertainment bias of SNP on the SNP array. 

This led to the identification of 23 new selection signatures 
(Figure 2 and Table 3), 9 of them being common to the analysis of 



[4], Overall, combining the scans for recent and ancestral 
selection, we failed to replicate 1 2 of the regions in [4] . 

Selection Signatures within population groups 

Coloration. Many selection signatures are located around 
genes that have been shown to be involved in hair, eye or skin 
color. In particular, several detected regions include candidate 
genes that are involved in the development and migration of 
melanocytes and in pigmentation: EDN3, KIT, KITLG, MC1R 
and MITF. For all these genes except MITF, we have quite strong 
evidence that they are the genes targeted by selection in the 
detected region. In the SWA group, EDN3 was included in the 
detected region for both FLK and hapFLK, and in both cases it 
was the closest gene to the highest test value. KIT and KITLG 
were both included in a detected region (with relatively few genes) 
for two different geographical groups, and were very close to the 
position with the smallest p-value in one of those. MC1R was also 
in a detected region for two different groups, NEU and ITA. In 
the two cases it was not very close to the maximum of the signal, 
but we note that the black skin or coat color is an important 
characteristic of the two populations that have been found under 
selection in this region, the Irish Suffolk and Sardinian Ancestral 
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Population group | |aFr| |ASl| |cEu| ||Ta| |nEu| |swa[ |sWE 



Figure 1. Localization of selection signatures identified in 7 groups of populations. Candidate genes are indicated above their genomic 
localization. Only chromosomes harboring selection signatures are plotted. 
doi:1 0.1 371 /journal.pone.01 0381 3.g001 



Black. This observation, together with the fact that MC 1 R 
mutations are responsible for coat color patterns in mammals (e.g 
in cattle [29]), supports the hypothesis that MC1R is a good 
candidate for the signatures we observed. 

Although not listed in Table 2, SOX 10 and ASIP, two other 
genes implied in pigmentation, also show some evidence of 
selection. In the ITA group, the q-value of hapFLK near SOX 10 
is 6.2% and almost reaches the significance threshold of 5%. 
Similarly, the two closest SNP to ASIP {s66432 and sl2884) 
present suggestive FLK p-values of respectively 7.5 10~ 4 and 
6.8 10~ 5 in the ASI group, and one (sl2884) is significantly 
differentiated between the ancestral groups. All these genes have 
previously been reported as being likely selection targets and/or 
associated to color patterns in different mammalian species. 



Finally, we found a signal for selection centered on the BNC2 
gene, that has recendy been associated with skin pigmentation in 
humans [30] . All population groups present at least one selection 
signature which is very likely related to one of the above genes, 
reflecting the widespread importance of color patterns to define 
sheep breeds. 

Inferring a precise history of underlying causal mutations for 
color patterns in this dataset is hard for several reasons: the precise 
phenotypic characterizations of coat color patterns in the Sheep 
HapMap breeds are not available; the 50K SNP array used does 
not offer sufficient density to associate a given selection signature 
to a specific set of polymorphisms; Finally, from the literature it 
appears that a large number of genes and mutations can be 
considered a priori as potentially causal for a given pigmentation 
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Figure 2. Genome scan for selection signature in ancestral populations of the geographical groups. Significant SNP at the 5% FDR level 

are plotted in darker color. 
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pattern. In particular, mutations in different genes can give rise to 
the same phenotype (e.g. in horses [31]). Also, within a gene 
different mutations can give rise to different phenotypes, e.g 
mutations in the MC 1 R gene (also named the extension locus) 
have been associated to a large panel of skin or coat colors 
[29,32,33]. Deciphering selection signatures related to coat color 
in sheep and in particular identifying the causal variants under 
selection will require sequencing these genes for individuals from 
several breeds with diverging color patterns. This in turn will help 
to understand the evolutionary history of the breeds and the effect 
of selection [34] . To potentially help in this task, in Table S 1 in 
File SI we list, for each "color gene", the populations that have 
likely been selected for. 

Morphology. Another group of genes that are found within 
selection signatures have known effects on body morphology and 
development. NPR2, HMGA2 and BMP2, pointed out previously 
[4] are confirmed as good positional candidates by our study. We 
also found strong evidence for selection on WNT5A, ALX4 or 
EXT2, and two HOX gene clusters (HOXA and HOXC). 
WNT5A and ALX4 are two genes involved in the development of 
the limbs and skeleton. Mutations in WNT5A are causing the 
dominant Human Robinow syndrome, characterized by short 
stature, limb shortening, genital hypoplasia and craniofacial 
abnormalities [35]. ALX4 loss of function mutations cause 
polydactily in the mouse, through disregulation of the sonic 
hedgehog (SHH) signaling factor [36,37]. Moreover, the ALX4 
protein has been shown to bind proteins from the HOXA 
(HOXA1 1 and HOXA3) and HOXC (HOXC 4 and HOXC5) 
clusters [38]. Located just besides ALX4 and corresponding to the 
same selection signature, EXT2 is responsible for the development 
of exostose in the mouse [39]. HOX genes are responsible for 
antero-posterior development and skeletal morphology along the 
anterior-posterior axis in vertebrates. The selection signature 
around HOXA is a recent selection signature in the SWA group, 
while that around HOXC is an ancestral signature with a high 
differentiation of the ASI ancestor compared to AFR and SWA 
(Table 3). 

Finally, we note that an ancestral selection signature is found 
near the ACAN gene, whose expression was shown to be 



upregulated by BMP2 [40], another candidate gene for selection. 
Three genes within the selection signature are found closer to the 
maximum test value than ACAN, but these are in silico predicted 
genes, whose protein coding function has not been confirmed, so 
ACAN seems to be overall a better candidate for explaining 
selection in the region. Mutations in the ACAN gene have been 
shown to induce osteochondrosis [41] and skeletal dysplasia [42]. 
The ACAN region has also been shown to be associated with 
height in humans [43]. 

Traits of agronomic importance. Sheeps have been raised 
for meat, milk and wool production. Under selection signatures, 
we found several genes associated with these production traits. In 
addition to the selection signature in Texels on the MSTN gene 
for increased muscularity [44], discussed in [13], we detected a 
selection signature centered on HDAC9 and including few other 
genes, which could also be linked to muscling. HDAC9 is a known 
transcriptional repressor of myogenesis. Its expression has been 
shown to be affected by the callypige mutation in the sheep at the 
DLK1-DI03 locus [45]. The signature around HDAC9 corre- 
sponds to a selection signature in the Garut breed from Indonesia, 
a breed used in ram fights. As already discussed, one selection 
signature contains ABCG2, a gene underlying a QTL with large 
effects on milk production (yield and composition) in cattle [22]. 
Also, one of the ancestral selection signatures reaches its maximum 
value close to the INSIG2 gene, recently shown to be associated 
with milk fatty acid composition in Holstein cattle [46]. Two 
selection signatures could be related to wool characteristics, one in 
the CEU group including the FGF5 gene, pardy responsible for 
hair type in the domestic dog [47,48], and an ancestral selection 
signature on chromosome 25 in a QTL region associated to wool 
quality traits in the sheep [49,50] . 

One of the strong outlying regions in the selection scan contains 
the PITX3 gene. Further analysis revealed that this signature was 
due to the German Texel population haplotype diversity differing 
from the other Texel samples (results not shown). It turns out that 
the German Texel sample consisted of a case/ control study for 
microphtalmia [7], although the case/control status information in 
this sample is not given in the Sheep HapMap dataset. The 
consequence of such a recruitment is to bias haplotype frequencies 
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in the region associated with the disease, which provokes a very 
strong differentiation signal between the German Texel and the 
other Texel populations. Although not related to artificial or 
natural selection in sheep, this signature illustrates that our method 
for detecting selection has the potential to identify causal variants 
in case/control studies, while using haplotype information. 

Ancestral signatures of selection 

For ancestral selection signatures, i.e. the regions showing 
outlying genetic differentiation between population groups, it is 
difficult to estimate how far back in time selection occurred. In 
particular, it would be interesting to place the divergences shown 
by the ancestral population tree with respect to sheep domestica- 
tion. Two interesting candidate genes for ancestral selection 
signatures might indicate that the selection signatures captured 
could be rather old. First, we found selection near the TRPM8 
gene, which has been shown to be a major determinant of cold 
perception in the mouse [51]. The pattern of allele frequency at 
the significant SNP (Table 3) is consistent with the climate in the 
geographical origins of the population groups. AFR, ASI and ITA, 
living in warm climates, have low frequency (0.04—0.16) of the A 
allele, while NEU and CEU, from colder regions, have higher 
frequencies (0.55-0.7), the SWE group having an intermediate 
frequency of 0.38. Overall, this selection signature might be due to 
an adaptation to cold climate through selection on a TRPM8 
variant. Another selection signature lies close to a potential 
chicken domestication gene, TSHR [52], whose signaling regu- 
lates photoperiodic control of reproduction [53]. This selection 
signature was identified before [4] and our analysis indicates that 
selection happened before the divergence of breeds within 
geographic groups, consistent with an early selection event. Given 
its role, we can speculate that selection on the TSHR gene is 
related to seasonality of reproduction. Under temperate climates, 
sheep experience a reproductive cycle under photoperiodic 
control. Furthermore, there is evidence that this control was 
altered during domestication [54] so our analysis suggests genetic 
mutations in TSHR may have contributed to this alteration. 

As discussed above, some of the genes found underlying 
ancestral selection signatures can be related to production or 
morphological traits {e.g. ASIP, INSIG2, ACAN, wool QTL), 
indicating that these traits have likely been important at the 
beginning of sheep history. The other genes that we could identify 
as likely selection targets in the ancestral population tree relate to 
immune response (GATA3) and in particular to antiviral response 
(TMEM154 [55], TRAF3 [56]). The most significant ancestral 
selection signature is centered around the NF1 gene, encoding 
neurofibromin. This gene is a negative regulator of the ras signal 
transduction pathway, therefore involved in cell proliferation and 
cancer, in particular neurofibromatosis. Due to this central role in 
intra-cellular signaling, mutations affecting this gene can have 
many phenotypic consequences so that its potential role in the 
adaptation of sheep breeds remains unclear. 

Conclusions 

The Sheep HapMap dataset is an exceptional resource for 
sheep genetics studies. In a population genomics context, our study 
shows that the rich information contained in these data permits to 
start unraveling the genetic history of sheep populations world- 
wide. In order to fully exploit this information, we used recent 
statistical approaches that account for the relationship between 
populations and the linkage disequilibrium patterns (haplotype 
diversity). This allowed detecting with confidence more selection 
signatures and identifying for most of them the selected 



populations. Among these new selection signatures detected by 
our study, several result from recent selection and include good 
positional candidate genes with functions related to pigmentation 
(KITLG, EDN3), morphology (WNT5A, ALX4, EXT2, HOXA 
cluster) or production traits (HDAC9). Two ancestral selection 
signatures are also of particular interest as they harbor genes 
(TRPM8 and TSHR) whose functions (cold and photoperiodic 
perception respectively) seem highly relevant to the selection 
response during the early history of sheep domestication. 

With information on adaptive genome regions and selected 
populations, we hope that our work will foster new studies to 
unravel the underlying biological mechanisms involved. To this 
aim, it is likely that further phenotypic and genetic data are 
required. On the genetics side, even though the SNP array used in 
this study was sufficient to localize genome regions harboring 
adaptive mutations, its density and the SNP ascertainment bias 
resulting from its design did not allow to tag the causative 
mutation precisely. Elucidating the causal variation underlying 
selection signatures will thus most likely require large scale 
sequencing data. 

Genome scans for selection, including this one, are identifying 
regions that are outliers from a statistical model and do not require 
to specify an alternative hypothesis based on phenotypic records. 
While this can be seen as an advantage for the initial localization 
of genome regions, it is a limitation for the identification of 
biological processes involved. Gathering phenotypic records in 
specific populations, in particular for color and morphology traits, 
will be needed to go further. 

Methods 

Selecting populations and animals. Seventy-four breeds 
are represented in the Sheep HapMap data set, but we only used a 
subset of these breeds in our genome scan. We removed the breeds 
with small sample size (< 20 animals), for which haplotype 
diversity cannot be determined with sufficient precision. Based on 
historical information, we also removed all breeds resulting from a 
recent admixture or having experienced a severe recent bottle- 
neck. Focusing on the remaining breeds, we then studied the 
genetic structure within each population group, in order to detect 
further admixture events. We performed a standardized PCA of 
individual based genotype data and applied the admixture 
software [16]. 

In two population groups (AFR and NEU) the different breeds 
were clearly separated into distinct clusters of the PCA and showed 
no evidence of recent admixture (Figures SI and S2 in File SI). 
These samples were left unchanged for the genome scan for 
selection. A similar pattern was observed in three other groups 
(ITA, SWA, ASI), except for a few outiier animals that had to be 
re-attributed to a different breed or simply removed (Figures S3, 
S4 and S5 in File SI). In the two last groups (CEU and SWE), 
several admixed breeds were found and were consequently 
removed from the genome scan analysis (Figures S6 and S7 in 
File SI). 

We performed a genome scan within each group of populations 
listed in Table 1, with a single SNP statistic FLK [10] and its 
haplotype version hapFLK [13]. 

Population trees. Both statistics require estimating the 
population tree, with a procedure described in details in [10]. 
Briefly, we built a population tree for each group by first 
calculating Reynolds' distances between each population pair, and 
then applying the Neighbor Joining algorithm on the distance 
matrix. For each group, we rooted the tree using the Soay sheep as 
an outgroup. This breed has been isolated on an island for many 
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generations and exhibits a very strong differentiation with all the 
breeds of the Sheep HapMap dataset, making it well suited to be 
used as an outgroup. 

FLK and hapFLK genome scans. The FLK statistic was 
computed for each SNP within each group. The evolutionary 
model underlying the FLK statistic assumes that SNP were already 
polymorphic in the ancestral population. To consider only loci 
that most likely match this hypothesis, we restricted our analysis 
within each group to SNP for which estimated ancestral minor 
allele frequency po was above 5%. Under neutrality, the FLK 
statistic should follow a y 2 distribution with n — 1 degrees of 
freedom (DF), where n is the number of populations in the group. 
Overall, the fit of the theoretical distribution to the observed 
distribution was very good (Text SI in File SI) with the mean of 
the observed distribution {FLK) being very close to n — 1 (Table S3 
in File SI). Using FLK as DF for the y 2 distribution provided a 
better fit to the observed data than the n— \ theoretical value. We 
thus computed FLK p-values using the y 2 (FLK) distribution. To 
compute the hapFLK statistic, we used of the Scheet and Stephens 
LD model [57], a mixture model for haplotypes which requires 
specifying a number of haplotype clusters to be used. To choose 
this number, for each group, we used the fastPHASE cross- 
validation based estimation of the optimal number of clusters. The 
results of this estimation are given in Table S2 in File S 1 . The LD 
model was estimated on unphased genotype data. The hapFLK 
statistic is computed as an average over 20 runs of the EM 
algorithm to fit the LD model. As in [13], we found that the 
hapFLK distribution could be modeled relatively well with a 
normal distribution (corresponding to non outlying regions) and a 
few outliers; we used robust estimation of the mean and standard 
deviation of the hapFLK statistic to eliminate the influence of 
outlying (i.e. potentially selected) regions. This procedure was done 
within each group, the resulting mean and standard deviation 
values obtained are given in Table S2 in File SI. Finally, we 
computed at each SNP a p-value for the null hypothesis from the 
normal distribution. 

Selection in ancestral groups. The within-group FLK 
analysis provides for each SNP an estimation of the allele 
frequency po in the population ancestral to all populations of the 
group. We used this information to test SNP for selection using 
between group differentiation, with some adjustments. First, the 
FLK model assumes tested polymorphisms are present in the 
ancestral population. SNP for which the alternate allele has been 
seen in only one population group are likely to have appeared after 
divergence (within the ancestral tree) and were therefore removed 
from the analysis. Second, regions selected within groups affect 
allele frequency in some breeds and therefore bias our estimation 
of the ancestral allele frequency in this group. We therefore 
removed all SNP that were included in within-group selection 
signatures. Finally, the FLK test requires a rooted population tree. 
For the within group analysis, we could use a very distant 
population to the current breeds (the Soay sheep). For the 
ancestral tree, we created an outgroup homozygous for ancestral 
alleles at aU SNP. 

Identifying selected regions and candidate genes. We 
defined significant regions for each statistic and within each group 
of populations. Using the neutral distribution (y 2 f° r FLK and 
Normal for hapFLK), we computed the p-value of each statistic at 
each SNP. To identify selected regions, we estimated their q-value 
[58] to control the FDR. For FLK, SNP with a q-value below 0.1 
were considered significant, which by definition implies that we 
expect 10% of false positives among our detected SNP. Since the 
power of hapFLK is greater than that of FLK [13], we used a q- 
value threshold of 0.05, therefore controlling FDR at the 5% level. 



For the FLK analysis in ancestral populations, we used an FDR 
threshold of 5%. 

We then aimed at identifying genes that seem good candidates 
for explaining selection signatures. We proceeded differendy for 
the single SNP FLK and hapFLK. For FLK, we considered that 
significant SNP less than 500Kb apart were capturing the same 
selection signal. Then, we considered as potential candidate genes 
any gene that lies less than 1Mb of any significant SNP. For 
hapFLK, the genome signal is much more continuous than single 
SNP tests, because the statistic captures multipoint LD with the 
selected mutations. A consequence is that the significant regions 
can span large chromosome intervals. To restrict the list of 
potential candidate genes, and target only the ones closest to the 
most significant SNP, we restricted our search to the part of the 
signal where the difference in hapFLK value with the most 
significant SNP was less than 0.5c. This allowed taking into 
consideration the profile of the hapFLK signal, i.e. if the profile 
resembles a plateau, the candidate region will be rather broad 
while very sharp hapFLK peaks will provide a narrower candidate 
region. We extracted all protein coding genes present in the 
significant regions using the Ensembl Biomart tool (http://www. 
ensembl.org/biomart/) for Ovis Aries 3.1 genome assembly. 
These full lists are provided as Supporting Information (Dataset S 1 
and Dataset S2). Within each candidate region, genes were ranked 
according to their distance from the most significant position of the 
region (the larger the rank, the larger the distance). The functional 
candidate genes shown in Table 2 and discussed in the manuscript 
were chosen based on this rank and/or on their implication in 
previous association or sweep detection studies. 
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